Partitioned Narratives: Thick Mapping the 1947 Partition Archive
Introduction
The documentation below is the white paper for the essay: “Partitioned Narratives: Thick Mapping the 1947 Partition Archive.” It includes the R code and csv files necessary to reproduce the calculations. Some of the spatial manipulations were performed in QGIS 3.16 (Hannover). When possible, images and Python code chunks have provided for reproduceability. Some steps involved converting CSV files to Geopackage files, as this is a common GIS workflow it has been skipped.
Part 1: Priming the NER Extracted Data
Load packages
The following packages: tidyverse, tidygeocoder,tidytext,stringi,htmlTable, are necessary to run this script.
Load location data
The loaded csv file is a cleaned up version of the one that results from scraping and running the data through NER. The cleaning process mostly involves removing false positives, consolidating similar locations (i.e. Bombay and Mumbai), and removing any corrupt data. This process also included coding the gender of the narrative and whether a person mentioned their occupation. Finally, for less known or ambigious locations we added the city and district to aid the geotagger.
Reformat partition_df
The following procedure primes the data for analysis:
- An address field is created by uniting the location, city, and country field
- Remove unnecessary strings from data fields
- Drop unnecessary columns
- Keep all distinct addresses by person name. This prevents double counting locations in a person’s account
partition_distinct_locations <- partition_df %>%
group_by(name) %>%
#create address field from locations, city, and country columns
unite("address",
locations:country,
sep = ", ",
na.rm = TRUE) %>%
#remove unnecessary text and add total locations column
mutate(
age = str_remove(age, "Age in 1947: "),
migrated_from = str_remove(migrated_from, "Migrated from: "),
migrated_to = str_remove(migrated_to, "Migrated to: "),
) %>%
#drop unnecessary columns
select(name:migrated_to, gender:address) %>%
#keep all distinct addresses. This helps reduce the query time for the geocoder.
distinct(address, .keep_all = TRUE) %>%
ungroup()Part 2: Geocoding
Find distinct addresses
The geocoding the addresses can be quite time consuming. To save time, we can run only the distinct addresses and then join these back to partition_distinct_locations afterwards.
Run geocoder
The script relies on the tidygeocoder package developed by Jesse Cambon, Diego Hernangómez, Christopher Belanger, Daniel Possenriede: tidygeocoder. The package allows users to select the geocoder of their choice. For the purposes of easy reproduceability OpenStreetMap (osm) was selected, though other services that require registration or login might be more accurate. This process is time consuming and has been commented out. The address file has been cached.
Join coordinates to distinct_partition_locations
The coordinates are joined to the existing dataframe distinct_partition_locations.
Clean final table
The geocoder will not necessarily catch all locations. Some of the locations have to be geocoded and corrected manually. This process is involved, and has to be done through QGIS. Several additional fields were created to keep track of the changes:
known - Whether the location was ultimately found. FALSE indication that the location is a best guess
camp - Indicates that this was a refugee camp. This data was not used
resolved_location - The final location name for the coordinate. There may be a discrepancy between this and the initial address
admin - Indicates whether this a larger administrative area within which other locations fall. It also includes rivers. Admin areas are dropped from analysis because they are redundant. Likewise, as the position of the river is often unknowable, this too was dropped.
Part 3: Statistical Overview
Import clean data
Read in the data file partition_geolocations_clean. This file is read-only to prevent accidental file corruption.
Aggregate location totals
Generate a table for all analysis: only include non-administrative areas, unique locations for each person, counts per person, and total counts per location.
partition_statistics <- partition_clean %>%
rename(latitude=9) %>%
rename(longitude=10) %>%
filter(admin == FALSE) %>%
filter(occupation!="No") %>%
mutate(PersonID = paste(name,"_",age)) %>%
group_by(PersonID) %>%
distinct() %>%
add_count(PersonID, name = "loc_by_name") %>%
ungroup() %>%
add_count(resolved_location, name = "loc_total")General Overview
#Get number of unique locations
unique_locations <- partition_statistics %>%
ungroup() %>%
select(resolved_location) %>%
distinct() %>%
nrow()
#Get number of unique people
unique_people <- partition_statistics %>%
ungroup() %>%
select(PersonID) %>%
distinct() %>%
nrow()
#Calculate mean locations mentioned
mean_locations <- partition_statistics %>%
ungroup() %>%
summarize(mean_locations = mean(loc_by_name))There are 768 unique locations in the data set. These are distributed across 320 people. On average, each person mentions 9.49 locations.
Locations by gender
Broken down by gender, it is clear that the mean number of locations by men is higher than that of women.
mean_locations_gender <- partition_statistics %>%
group_by(gender) %>%
summarize(mean_gender = round(mean(loc_by_name), 2))
mean_locations_gender %>%
addHtmlTableStyle(col.rgroup = c("none", "#F5FBFF")) %>%
htmlTable(header = c("Gender", "Mean Locations Mentioned"))| Gender | Mean Locations Mentioned | |
|---|---|---|
| 1 | Female | 8.67 |
| 2 | Male | 9.88 |
Locations by occupation
A similar trend emerges when accounting for occupation. Here, people who mention their occupation mention more locations.
mean_locations_occupation <- partition_statistics %>%
group_by(occupation) %>%
summarize(mean_occupation = round(mean(loc_by_name), 2))
mean_locations_occupation %>%
addHtmlTableStyle(col.rgroup = c("none", "#F5FBFF")) %>%
htmlTable(header = c("Occupation", "Mean Locations Mentioned"))| Occupation | Mean Locations Mentioned | |
|---|---|---|
| 1 | Not Mentioned | 8.06 |
| 2 | Yes | 9.92 |
Locations by occupation and gender
The contrast between locations mentioned and the gender and whether occupation is mentioned becomes even starker when the values are disaggregated.
mean_locations_occ_gen <- partition_statistics %>%
group_by(gender, occupation) %>%
summarize(mean_location = round(mean(loc_by_name), 2))mean_locations_occ_gen %>%
addHtmlTableStyle(col.rgroup = c("none", "#F5FBFF")) %>%
htmlTable(header = c("Gender", "Occupation", "Mean Locations Mentioned" ))| Gender | Occupation | Mean Locations Mentioned | |
|---|---|---|---|
| 1 | Female | Not Mentioned | 8.24 |
| 2 | Female | Yes | 9.28 |
| 3 | Male | Not Mentioned | 7.19 |
| 4 | Male | Yes | 10.05 |
Percent mention of occupation by gender
Generally, men mentioned their occupations significantly more than women.
partition_statistics %>%
group_by(gender) %>%
select(PersonID, gender, occupation) %>%
distinct() %>%
count(occupation) %>%
mutate(percent = paste(round(n/sum(n),2)*100,"%")) %>%
addHtmlTableStyle(col.rgroup = c("none", "#F5FBFF")) %>%
htmlTable(header = c("Gender", "Occupation", "Number of People", "Percent" ))| Gender | Occupation | Number of People | Percent | |
|---|---|---|---|---|
| 1 | Female | Not Mentioned | 70 | 62 % |
| 2 | Female | Yes | 43 | 38 % |
| 3 | Male | Not Mentioned | 17 | 8 % |
| 4 | Male | Yes | 192 | 92 % |
Distribution of locations mentioned
The distribution pattern of locations mentioned shows that men without occupations make a negligible impact on the mean number of locations mentioned. Meanwhile, the number of women without occupations is quite substantial and do tend to mention fewer locations. Even among those who mentione their occupation, the men’s distribution has a longer tail.
partition_statistics %>%
distinct(PersonID, gender, occupation, loc_by_name) %>%
ggplot(aes(loc_by_name, fill = gender)) +
geom_histogram(
color = "black",
opacity = .8 ,
alpha = .4,
position = "identity"
) +
scale_fill_brewer(palette = "Pastel2") +
labs(title = "Histogram of Mentioned Locations by Occupation and Gender",
x = "Occupation",
y = "Number of Locations Mentioned",
fill = "Gender") +
facet_wrap(~ occupation) +
theme_classic()Figure 1: Locations Mentioned by Occupation and Gender
T-test and ANOVA score
#Generate t-scores
gender_ttest <- t.test(loc_by_name ~ gender, partition_statistics)
occupation_ttest <- t.test(loc_by_name ~ occupation, partition_statistics)
#Create dataframes of tscores
tscores <- map_df(list(gender_ttest, occupation_ttest), tidy)
tscores <- tscores[c("p.value")]
#Generate variables for ANOVA
gender_occupation <- partition_statistics %>%
unite("gender_occupation", gender:occupation, remove = FALSE)
anova_gender_occupation <-
aov(loc_by_name ~ gender_occupation, gender_occupation)
#Create dataframe for ANOVA score
anovascore <- map_df(list(anova_gender_occupation), tidy)
anovascore <- anovascore[c("statistic", "p.value")]T-tests of both gender and occupation individually affirms what visual inspection already suggests: that the mean distribution is not random. A Welch Two Sample t-test was done both on the difference in means of locations by gender (p = 6.1e-16) and the difference in means of locations by occupation (p = 8.2e-34)affirms what visual inspection already suggests: that the mean distribution is not random. At the same time, an analysis of variance (ANOVA) test reveals an F-score of 44.42 and a p value of 5.7e-28, indicating that the variance between means is greater than the variance within means and not random.
Part 4: Spatial Analysis
The spatial analysis of the data set was done with QGIS. As these manipulations are hard to document, only their result is shown. There were a number of cases where the tidygeotagger did not properly catch all of the locations. These had to be added manually.
Location diversity
Departure locations
#Departure locations
part_from <- partition_statistics %>%
mutate(migrated_from = str_extract(migrated_from, "[^,]+")) %>%
drop_na(migrated_from) %>%
select(PersonID, migrated_from, gender) %>%
distinct(PersonID, migrated_from, gender) %>%
add_count(migrated_from, name = "total_location") %>%
group_by(gender) %>%
add_count(migrated_from, name = "loc_gender") %>%
add_count(gender, name = "gender_tot") %>%
mutate(percent = loc_gender / gender_tot) %>%
select(-PersonID,-loc_gender,-gender_tot) %>%
distinct() %>%
#ungroup() %>%
arrange(desc(total_location), gender) %>%
top_n(5, total_location) %>%
mutate(percent = percent(percent,2)) %>%
select(-total_location)
#Number of departure locations by gender
gender_migration <- partition_statistics %>%
drop_na(migrated_from) %>%
distinct(migrated_from, gender) %>%
group_by(gender) %>%
count(gender)The first thing that is notable about the departure locations is their diversity. While a plurality of people departed from Lahore (30%) and a second group from Rawalpindi (22%), there were many who departed from quite different locations. In fact, women departed from 43, while men departed from 89 different locations.
We can observe this diversity of points of departure by looking at a spatial representation of the direct lines of travel to Delhi and noting the diversity of points of origin.
Figure 2: Departure locations during Partition
| Migrated From | Gender | Percent departure by Gender | |
|---|---|---|---|
| 1 | Lahore | Female | 30% |
| 2 | Lahore | Male | 22% |
| 3 | Rawalpindi | Female | 12% |
| 4 | Rawalpindi | Male | 8% |
| 5 | Multan | Female | 4% |
| 6 | Multan | Male | 4% |
| 7 | Faisalabad | Female | 6% |
| 8 | Faisalabad | Male | 4% |
| 9 | Dera Ismail Khan | Female | 6% |
| 10 | Dera Ismail Khan | Male | 2% |
| Table 3: Top 5 Departure locations by gender | |||
Transit Locations
partition_transfer <- partition_statistics %>%
#Filter out Delhi as a final location
filter(resolved_location != "Delhi") %>%
#Clean up the migrated from and migrated to data
mutate(migrated_from = str_extract(migrated_from, "[^,]+")) %>%
mutate(migrated_to = str_extract(migrated_to, "[^,]+")) %>%
#Remove all cases where the migrated from location is the same as one of the transit locations
filter(migrated_from != resolved_location) %>%
#Remobe all cases where the resolved location equals migrated to.
filter(migrated_to != resolved_location) %>%
#Get the number of transfer locations based on where people migrated from and their gender
group_by(gender) %>%
mutate(total_gender = n_distinct(PersonID)) %>%
group_by(migrated_from, gender) %>%
add_count(resolved_location, name = "migration_location", sort =
TRUE) %>%
#Calcuate the percentage as a share of all migration locations
mutate(percent_transit = migration_location / total_gender) %>%
#Clean up table for presentation
select(migrated_from,
gender,
resolved_location,
migration_location,
percent_transit) %>%
distinct(migrated_from, resolved_location, percent_transit) %>%
arrange(desc(percent_transit), resolved_location) %>%
ungroup() %>%
top_n(10, percent_transit) %>%
mutate(percent_transit = percent(percent_transit, 2)) %>%
relocate(gender, .before = migrated_from)Likewise the transit locations were also quite diverse. Amritsar occurs more frequently for both women (10%) and (8%), but does not stand out as the majority locations.
#Generate table
partition_transfer %>%
addHtmlTableStyle(col.rgroup = c("none", "#F5FBFF")) %>%
htmlTable(header = c(
"Gender",
"Migrated From",
"Transfer",
"Percent of Respondents Transfered"
))| Gender | Migrated From | Transfer | Percent of Respondents Transfered | |
|---|---|---|---|---|
| 1 | Female | Lahore | Amritsar | 10% |
| 2 | Male | Lahore | Amritsar | 8% |
| 3 | Female | Rawalpindi | Lahore | 6% |
| 4 | Female | Lahore | Rawalpindi | 6% |
| 5 | Female | Lahore | Anarkali Bazaar | 4% |
| 6 | Female | Lahore | Shimla | 4% |
| 7 | Male | Lahore | Rawalpindi | 4% |
| 8 | Female | Faisalabad | Amritsar | 4% |
| 9 | Female | Rawalpindi | Karol Bagh | 4% |
| 10 | Female | Lahore | Mumbai | 4% |
| 11 | Female | Lahore | Mussoorie | 4% |
The spatial analysis requires several manipulations of the data that were done in QGIS. What follows is a brief outline.
Create from Locations
Note: the geocoding process is skipped for the purposes of this notebook
- Subset the data into from hubs.
hub_from <- partition_statistics %>%
select(migrated_from) %>%
drop_na() %>%
filter(migrated_from!="TBA") %>%
distinct()- Geocode each from hub.
- Attach data back to
from_hubs.
hub_from_join <- hub_from_geo %>%
rename(migrated_from = address)
hub_from_join <- partition_statistics %>%
left_join(hub_from_join)
hub_from_join <- hub_from_join %>%
select(name, age, migrated_from, gender, occupation, lat, long) %>%
filter(migrated_from != "TBA") %>%
drop_na(migrated_from) %>%
distinct()- Write from_hub file for geoprocessing.
osmWill not necessarily catch all locations. Some of these have to be hand coded.
- Create line geometry for departure locations to Delhi.
hub_from_join_clean <- hub_from_join_clean %>%
mutate(WKT = paste("LINESTRING(",long," ", lat, ",", "","77.2219388","28.6517178)"))
write_csv(hub_from_join_clean, "data/hubs_to_delhi.csv") - Measure distance from
from_hubto Pakistan border using the NNjoin plugin for QGIS.
Evaluating distance to border
#Get the mean distance by gender
group_mean <- distance_to_border %>%
group_by(gender) %>%
summarise(grp_mean = mean(distance_km),
group_median = median(distance_km))
#Get percentage of people who travelled more than 100km
more_than_100 <- distance_to_border %>%
mutate(n = n()) %>%
filter(distance_km > 100) %>%
summarise(more_than = n() / n) %>%
distinct()The path of travel to the border was quite distant for the majority of interviewees. With men and women both traveling more than 100km on average, and the median distance also exceeding 100km (women = 105km, men = 128km). Even though it is a rather arbitrary distance, the majority of people (59%) traveled more than 100km to get to the border. The sense that the majority interviewees travelled from quite far to even get to the border is also born out in the distribution of distances travelled.
distance_to_border %>%
group_by(gender) %>%
ggplot(aes(distance_km, fill = gender)) +
geom_histogram(
color = "black",
opacity = .8 ,
alpha = .4,
position = "identity"
) +
scale_color_brewer(palette = "Pastel2") +
scale_fill_brewer(palette = "Pastel2") +
labs(title = "Histogram of Distance to Border by Gender",
x = "Distance in km",
y = "Count",
fill = "Gender") +
facet_wrap(~ gender) +
theme_classic() +
geom_vline(data = group_mean,
aes(xintercept = grp_mean, color = gender),
linetype = "dashed") +
theme(legend.position = "none") +
geom_text(data = group_mean,
aes(
x = grp_mean,
y = 0,
label = paste("Mean Distance = ", round(grp_mean, 0), "km"),
hjust = -.05,
vjust = -22
))Figure 3: Distribution of Distance to Border
Analyzing Hub and Spokes Model
Using QGIS it is possible to take all of the locations in each narrative and attach them to a central hub in this case Delhi.
to_hubs <- partition_statistics %>%
filter(resolved_location!="Delhi") %>%
mutate(WKT = paste("LINESTRING(",longitude," ", latitude, ",", "","77.2219388","28.6517178)")) %>%
select(name,age,gender,occupation,resolved_location,migrated_from,migrated_to,PersonID,loc_by_name,loc_total,WKT)
write_csv(to_hubs, "data/hub_and_spoke.csv")Figure 4: Locations Mentioned in the Interviews